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Abstract 


Based on the collected multiwavelength data, namely in the radio (NVSS, FIRST, RATAN-600), IR (WISE), 
optical (Pan-STARRS), UV (GALEX), and X-ray (ROSAT, Swift-XRT) ranges, we have performed a cluster 
analysis for the blazars of the Roma-BZCAT catalog. Using two machine learning methods, namely a combination 
of PCA with k-means clustering and Kohonen’s self-organizing maps (SOMs), we have constructed an 
independent classification of the blazars (five classes) and compared the classes with the known Roma-BZCAT 
classification (FSRQs, BL Lacs, galaxy-dominated BL Lacs, and blazars of an uncertain type) as well as with the 
high synchrotron peaked (HSP) blazars from the 3HSP catalog and blazars from the TeVCat catalog. The obtained 
groups demonstrate concordance with the BL Lac/FSRQ classification along with a continuous character of the 
change in the properties. The group of HSP blazars stands out against the overall distribution. We examine the 
characteristics of the five groups and demonstrate distinctions in their spectral energy distribution shapes. The 
effectiveness of the clustering technique for objective analysis of multiparametric arrays of experimental data is 


demonstrated. 
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1. Introduction 


Blazars are a rare type of active galactic nuclei (AGNs) with 
a jet of relativistic plasma pointing toward the Earth at 
relatively small angle (e.g., Urry & Padovani 1995; Blandford 
et al. 2019). Blazars are also among the brightest AGNs, and 
the Doppler-beaming effect (Madau et al. 1987; Ghisellini et al. 
1993; Fan et al. 2017) makes their jet emission even more 
boosted and visible up to z ~ 6 (Belladitta et al. 2020). They are 
characterized by complex properties such as extreme variability 
at all wavelengths, high luminosity, high degree of polariza- 
tion, and brightness temperatures exceeding the Compton limit 
(Urry 1999). 

Blazars are the dominant sources in the extragalactic gamma- 
ray sky. Because of the relativistic amplification of their 
emission, sources even at high redshifts are observed. The 
investigation of the multiwavelength properties of high redshift 
blazars is especially important as they are the most powerful 
non-explosive astrophysical sources and their study can be 
crucial for understanding the jet formation and propagation 
around supermassive black holes. Recently found connection 
between blazars and IceCube sources of high-energy neutrinos 
(Plavin et al. 2020) also adds to the topicality of their 
investigation. 

The typical spectral energy distribution (SED) of a blazar is 
dominated by the non-thermal radiation from the jet and 


consists of the synchrotron (peaking between the far-infrared 
and the soft X-ray bands) and inverse-Compton (peaking in the 
hard X-ray to gamma-ray bands) humps (Abdo et al. 2010). 
Besides that, the SED of a blazar can also feature the thermal 
radiation from the host galaxy (infrared (IR) hump or stellar 
emission) and the emission from the accretion disk around the 
central black hole (“blue hump”) and from the broad line region 
(Giommi et al. 2012b). Blazars exhibit large and rapid 
variations on a variety of timescales from years to intervals 
even shorter than an hour (e.g., Padovani et al. 2017 and 
references therein). 

Blazars are subclassified as flat-spectrum radio quasars 
(FSRQs) and BL Lacerta-type objects (BL Lacs) based on their 
optical spectra: FSRQs show broad emission lines, while BL 
Lacs display either very weak emission lines or even are 
completely featureless (e.g., Urry & Padovani 1995; Falomo 
et al. 2014). Another classification was proposed based on the 
luminosity of the broad-line region (BLR) in Eddington 
luminosity (Ghisellini et al. 2011): sources with Lgtr/Leaa 
higher or lower than 5 x 1074 were classified as FSRQs or BL 
Lacs, respectively, according to a transition of the accretion 
regime from radiatively efficient to an inefficient one between 
the two classes. 

Based on the peak frequency (Vpeak) of the synchrotron energy 
hump, blazars are usually subclassified as low (LSP, Vpeak < 
10'* Hz), intermediate (ISP, 10'* Hz <Vpeak < 10'° Hz), or high- 
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synchrotron peaked (HSP, Upeax > 105 Hz) blazars (Abdo et al. 
2010; Fan et al. 2016). Most HSP and ISP blazars have been 
classified as BL Lacs, while the LSP class contains both FSRQs 
and LSP BL Lacs (Böttcher 2019; Prandini & Ghisellini 2022). 

Inspired by the observed data, alternative physical categor- 
izations for blazars are proposed: for instance, based on the 
sources with intrinsically weak or strong O II and O M emission 
lines (Landt et al. 2004); based on the different accretion rates 
(the luminosity of the broad line region relative to the 
Eddington luminosity) of the two subclasses of blazars (e.g., 
Ghisellini et al. 2011; Sbarrato et al. 2012); based on the 
ionizing radiation emitted from the accretion disk (Giommi 
et al. 2012a; Giommi et al. 2013; Giommi & Padovani 2015); 
based on the kinematic features of radio jets (e.g., Hervet et al. 
2016); etc. 

The above mentioned numerous approaches to blazar 
classification tend to use a single categorical (presence/absence 
of emission lines) or a single numerical parameter (HSPs, 
Lprr/Leaa), in the latter case also categorized by setting a 
threshold defined by the researchers. At the same time blazars, 
like any objects, have numerous measurable characteristics that 
define their properties, and contemporary computing power and 
machine learning (ML) methods allow us to investigate a large 
number of characteristics in all their complexity. 

In this paper we perform multiparametric cluster analysis for 
the Roma-BZCAT catalog (Massaro et al. 2015), a sample of 
blazars with the most complete set of characteristics observed 
in different ranges of the electromagnetic spectrum. The aim is 
to divide the blazars into groups with similar properties to 
further analyze the differences between the groups to check the 
performance of the ML (clustering) methodology and compare 
it with the generally accepted classification approaches. 


2. General Conception of Cluster Analysis 


Cluster analysis, or clustering, is a classical problem of 
unsupervised ML, i.e., learning with unlabelled data, when the 
model is not given in advance any target variable, in this case 
the classes of considered objects. The aim of the clustering 
model is to combine similar objects in groups (clusters) based 
on the similarity of their characteristics, or features. The 
principal idea is that when these characteristics are expressed 
numerically, the objects with similar properties are located 
closer to each other in the feature space than those with greater 
differences. In the simplest case of two-three features and 
clearly separated clusters this problem can be solved visually 
by constructing usual two-dimensional (2D) or three-dimen- 
sional (3D) scatter diagrams. In the general case of an arbitrary 
number of characteristics, the clustering must be performed in 
an n-dimensional feature space. ML algorithms are capable of 
solving such problems successfully even for complex 
distributions. 
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Note that clustering in terms of ML should be distinguished 
from classification, which is a separate problem of supervised 
ML, when the model is trained to guess the classes known 
a priori. The main difference between the unsupervised 
problem of clustering and the supervised or semi-supervised 
problem of classification is the approach itself: while in the 
latter case we exploit a known classification developed by 
some other methods and assign the known classes to new 
objects, in clustering we develop a new classification based 
solely on the data collected for the objects. This allows one to 
describe a sample based on experimental data, avoiding as 
much as possible the subjective approach to the division of 
objects into different types. 

The mathematical formulation of the clustering is as follows. 
Let {X} € RYM be a set with dimension N x M, where N is 
the number of objects x,, and M is the number of their features 
Xm- The set {X} can be represented as a matrix X = (X„m) with 
n=1, 2,...N=[1, N], m=[1, M]. Let {Y} € Z be a set of 
cardinality K, where K is the number of clusters, {Y} = [1, K]. 
The solution of the clustering problem is finding an algorithmic 
function a: {X} — {Y} that assigns a singular label y,, k= [1, 
K] to each object x„, n = [1, N] in such a way that the objects 
with similar properties (ideally forming separated groups in the 
feature space xm) correspond to the same label (cluster). Each 
object x, can be represented in the feature space as a vector of 
dimension M: xX, = (Xn1;----Xnm). The measure of object 
similarity is a metric of distance between the vectors, in our 
case this is the Euclidean distance 


d= |x; — xl, (1) 


where x; and x; are any two vectors x, (sample objects). In 
order for the features to have equal priorities in the clustering 
process, they should be normalized beforehand to the same 
scale. 

We should note that cluster analysis is a heuristic and its 
exact results are always model dependent both on the choice of 
the features selected for the modeling and on the clustering 
algorithm a: X — Y. An additional degree of freedom is the 
number of clusters K, which in most algorithms is defined 
a priori and as well evaluated heuristically based on the data. In 
this sense the obtained structure of the clusters should not be 
considered as an established natural phenomenon, especially 
when the clusters are not well separated; the cluster analysis, to 
a greater degree, is an instrument to search for patterns in the 
sample rather than investigation of individual objects. 

Generally, the problem solution can be divided into several 
Stages: 


1. data collection and feature engineering; 

2. selection of characteristics for the model feature space; 

3. clustering with different algorithms in searching for a 
model with the best quality metrics; 
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4. interpretation of the result, i.e., analysis of the difference 
between objects in different groups. 


Further in the paper we successively consider the mentioned 
stages. 


3. Initial Data 


In this section, we describe the databases and characteristics 
that have been used to compile the dataset of this project. Some 
of the characteristics are not directly used for the clustering 
because they are available for only a small number of blazars, 
nevertheless they are useful for subsequent analysis. 

The basis for our dataset is the 5th edition of the Roma- 
BZCAT catalog of blazars (Massaro et al. 2009, 2015). The 
catalog contains a list of 3561 AGNs which are classified by 
the authors as blazars based on their observed properties. The 
following information is available: coordinates; redshifts; 
optical magnitudes in the R band from USNO-B1.0, r filter 
from Sloan Digital Sky Survey (SDSS) DR10, or in other filters 
when these data are absent; 1.4GHz (NVSS, FIRST, 21 cm,) 
and 143 GHz (Planck, 2.1mm) radio flux densities; X-ray 
(ROSAT, Swift-XRT) and gamma-ray (Fermi-LAT) fluxes as 
well as the radio-to-optical spectral index characterizing the 
ratio between the radio and optical emission. Note that the R 
band magnitude presented in Roma-BZCAT describes the 
optical radiation rather loosely, being sometimes obtained from 
different photometric filters. For this reason we used more 
consistent data on optical magnitudes from other catalogs. 

Based on the BLcatt RATAN-600 measurements at 
frequencies 1-22 GHz covering the period of observations 
2006-2022 (Mingaliev et al. 2014; Sotnikova et al. 2022) and 
the CATS database” (Verkhodanov et al. 1997, 2005), we 
calculated the averaged spectral flux density at a frequency of 
5 GHz, radio spectral indices, and radio variability. The 
averaged spectral indices a were calculated for the 1-2, 2-5, 
5-8, 5-11, 8-11, 11-22, 8-22, and 5-22 GHz ranges. The 
radio variability is given at frequencies of 1, 2, 5, 8, 11, and 
22 GHz. The variability index was calculated using the formula 
from Aller et al. (1992) 


V= (Smax — OS max) = (Smin Ex TS min) i (2) 


(Smax — TS mnax) + (Smin + TS min) 


where Smax, Smin are the maximum and minimum flux densities, 
respectively, and os a> ISa are their standard errors. 

The IR measurements are represented by data from the 
Wide-field Infrared Survey Explorer (WISE) in the W1, W2, 
W3, W4 bands (3.4, 4.6, 12, 22 um) and by the Two Micron 
All-Sky Survey (2MASS) data in the JHK bands (1.25, 1.65, 
2.2 um). The data were taken from the NASA/IPAC Infrared 


4 https: //www.sao.ru/blcat/ 
? https: //www.sao.ru/cats/ 
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Science Archive® using the pyvo Python library,’ identification 
of the blazars was carried out by coordinates using the cone 
search query in the 9” field of view (the WISE angular resolution 
is about 6”). 

The optical range is represented by the Pan-STARRS® 
measurements in the grizy filters (effective wavelengths of 
4810, 6170, 7520, 8660, and 9620 A). The data were obtained 
by a standard request to the archive with a list of object 
coordinates. Additionally, for the optical range we downloaded 
the SDSS DR17? data in the ugriz filters (effective wavelengths 
3557, 4702, 6175, 7491, and 8946 A) via the provided web 
form (the requests were made automatically using a script). 

The ultraviolet (UV) range is represented by the GALEX 
FUV and NUV channels (effective wavelengths of 1538.6 and 
2315.7A respectively). The data were obtained from the 
Mikulski Archive for Space Telescopes (MAST) using the 
astroquery Python library. A UV counterpart was identi- 
fied as the object closest to the coordinates in the 7”2 field of 
view (GALEX angular resolution is 4”). 

In all the above cases, if there were several objects in the 
field of view of a search query, we chose the closest by angular 
distance. The possible presence of outliers was additionally 
controlled by histograms of angular distance deviations from 
the blazar coordinates. Since the identification was carried out 
in automatic mode, we cannot completely exclude incorrect 
identifications in some cases, but since we took the minimum 
possible search radius, comparable to the resolution of the 
instruments, such cases should be rare and can be discarded 
during subsequent data cleansing. 

The extinction was determined using NED’s Coordinate and 
Galactic Extinction Calculator.'° For the GALEX FUV and 
NUV channels, we used the extinction law from Fitzpatrick 
(1999), the calculations were made with the extinction 
Python library.'' For the WISE IR range, the extinction was 
considered zero. 

To determine the peak frequency of the synchrotron 
component, we used SEDs obtained from the SED Builder'* 
tool of the Italian Space Agency (ASI) Space Science Data 
Center (SSDC). The measurements were downloaded using 
Selenium WebDriver, * which allows interaction with web 
sites in an automated mode. 

Thus, in the initial dataset we managed to collect a fairly 
extensive set of observed data in various ranges of the 
electromagnetic spectrum: from radio to gamma emission. 
The dataset also includes information about redshift, spectral 
6 https://irsa.ipac.caltech.edu 
https: //pypi.org/project/pyvo/ 
https: //outerspace.stsci.edu /display /PANSTARRS / 
https://skyserver.sdss.org/dr17/ 
10 http: //ned.ipac.caltech.edu/forms /calculator.html 
' https: //extinction.readthedocs.io/en /latest / 
12 https: / /tools.ssdc.asi.it/SED/ 

https: //www.selenium.dev 
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Figure 1. Logarithm of radio luminosity (at 5 GHz) vs. redshift (left) and comoving distance (right). 


indices, and estimates of variability in the radio range. In the 
next section, we describe additional processing of the derived 
data in order to extract more informative features and present a 
complete dataset of the obtained and calculated characteristics. 


4. Calculation of Blazars’ Characteristics 


The initial characteristics obtained from catalogs often 
cannot be directly used in the model, because they might 
describe not only the properties of the objects but also other 
factors affecting the result. For example, the magnitudes 
depend on the photometric system of a particular catalog. 
Therefore, to solve our problem, we should obtain such 
characteristics that in the best possible way describe the 
physical properties of the blazars. 

Since blazars are located at different cosmological distances, 
these characteristics should be related to the rest frame of an 
object. Unfortunately, this is often impossible in practice. For 
instance, to estimate the luminosity at a certain frequency at 
cosmological distances, a good description of the SED shape is 
necessary, but for most of the blazars we have only point 
estimates of this shape at a number of frequencies; moreover, 
the SED can be variable. Empirical analytical dependencies 
(see, e.g., Chilingarian et al. 2010) work only at small 
redshifts z < 0.5. 

Here we will use the rest frame characteristics where 
possible, the remaining features will be given in the observer’s 
frame of reference, and the distance to a blazar will also be set 
as one of the parameters. In Section 5 we describe how this 
might affect the results in more detail. 

The distance to a blazar could be described directly by the 
redshift, but in this scale the distance distribution of the blazars 
appears rather crowded and uneven. In Figure 1 the 
dependences of the monochromatic radio luminosity on the 
redshift and on the comoving distance are presented. The 
comoving distance scale is more suitable for the modeling, the 


distances were determined from the redshifts using the 
astropy library (Astropy Collaboration et al. 2022) and 
based on the ACDM cosmology with the Planck Collaboration 
parameters (Planck Collaboration et al. 2016): Hyp = 
67.74kms_' Mpc’, Qm = 0.3089, Q, = 0.6911. Along with 
this, some other parameters were calculated: luminosity 
distance, distance modulus, lookback distance, and the 
Universe’s age in the blazar rest frame at the time of light 
emission. The monochromatic (5 GHz) radio luminosity was 
estimated by the formula 


Ls = 4nD2S5s(1 + 2-27}, (3) 


where Dz is the luminosity distance, Ss is the flux density at 
5 GHz, z is the redshift, and œ is the averaged spectral index 
taken for 5-11 GHz or, where the measurements were absent, 
for 5-8 GHz. 

The dependence of radio luminosity on distance in Figure 1 
is completely or partially caused by the selection effects. One 
of them is the Malmquist bias (e.g., Butkevich et al. 2005): 
roughly speaking, if we assume a normal luminosity distribu- 
tion, the same at all distances d, then with increasing distance 
the detection limit shifts to higher luminosities, which reduces 
the number of objects in the left wing of the distribution, and 
simultaneously the expected number of high-luminosity objects 
grows with increasing area of the sphere of radius d, which 
increases the probability of detecting bright objects in the right 
wing of the distribution. Since both effects are proportional to 
d’, a linear increase in average luminosity with distance should 
be observed. It must be noted that the observed dependence is 
not linear at distances S2000 Mpc; however, selection cannot 
be excluded in this case either, since the measurements were 
obtained in different surveys, and the interest to the nearest but 
not necessarily luminous objects is natural. 

The optical flux densities in the observer’s frame of 
reference were calculated based on the AB magnitude system 
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Table 1 
Parameters of the Magnitude-to-flux-density Transformations 


Band Effective Wavelength Zero Magnitude 
A =c/v) Flux Density F,o, Jy 

GALEX FUV 1538.6 A 3631 
GALEX NUV 2315.7A 3631 
SDSS u 3557 A 3631 
SDSS g 4702 A 3631 
SDSS r 6175 A 3631 
SDSS i 7491 A 3631 
SDSS z 8946 A 3631 
Pan-STARRS g 4810 A 3631 
Pan-STARRS r 6170 A 3631 
Pan-STARRS i 7520 A 3631 
Pan-STARRS z 8660 A 3631 
Pan-STARRS y 9620 A 3631 
2MASS J 1.235 um 1594 
2MASS H 1.662 um 1024 
2MASS K 2.159 um 666.8 
WISE W1 3.3526 um 309.54 
WISE W2 4.6028 um 171.787 
WISE W3 11.5608 um 31.674 
WISE W4 22.0883 um 8.363 
as 

logo 0E) = logy (VF) — 2E, (4) 


2.5 


where v is the effective frequency of a photometric band, Fo is 
the flux density from zero magnitude, and m, and E, are the 
magnitude and extinction in the photometric band (extinction 
according to the NED data) respectively. The v and F, values 
were taken from the descriptions of corresponding photometric 
systems (Cohen et al. 2003; Morrissey et al. 2007; Jarrett et al. 
2011; Tonry et al. 2012). For the SDSS we implemented the 
corrections u = u — 0.04, z=z+0.02, according to accepted 
practice. Summary data are given in Table 1. 

Based on the available stellar magnitudes and extinctions, 
we calculated optical colors in various photometric systems. 
However, their use in the clustering seems impractical. In 
Figure 2, SEDs for five random blazars are presented, the 
points mark flux densities log,, vF, in the WISE, 2MASS, Pan- 
STARRS, and GALEX passbands; for better visualization of 
individual SEDs the points are connected by lines. Flux 
densities are related to stellar magnitudes, and the difference 
between the pairs of the points generally represents the optical 
colors. It can be seen that in the frequency range of each 
individual instrument, for example WISE and Pan-STARRS 
whose data we further use in the clustering, a predominant 
slope of the spectrum can be assigned for a particular blazar, 
while the ratios between flux densities for the pairs of points 
(“colors”) can sometimes differ significantly. Thus, the use of 
colors in the clustering can create additional noise that would 
not allow the algorithm to estimate the predominant slope of 


Kudryavtsev et al. 


—11.75 4 


—12.00 4 


=2 s71] 


—12.25 4 


E 
U -12.504 
D 
g 
© -12.754 : 
ue ; ; ; 
5 —13.00 : : : 
[=] i F H 
D -13.25 | ; ; a s 
E : : Di 
-13.50] bea 
WISE i 2MASS i Pan-STARRS : GALEX , 
-13.75 r — — — 
13.5 14.0 14.5 15.0 
logiov, [Hz] 


Figure 2. SEDs for five randomly selected blazars, constructed based on the 
data from the WISE, 2MASS, Pan-STARRS, and GALEX catalogs. 


the spectrum. For this reason, we calculated special features: 
tangents of the spectrum slope in the WISE and Pan-STARRS 
ranges (the slopes for 2MASS and GALEX were not used due 
to lack of data). The spectrum slopes were approximated by a 
linear dependence using the method of least squares. 

To determine the frequency of the synchrotron peak, we used 
flux densities downloaded from the ASI SSDC SED Builder. 
The position of the peak was determined by approximating a 
SED with polynomials of the third or, in some cases, second 
degree. The flux density measurements were represented by 
the data from the SSDC resident catalogs and other catalogs. 
To calculate the parameters of the polynomial, the program 
used only the resident catalogs, while we visually controlled 
the obtained result using all the measurements (see Figure 3). 
The frequencies were transformed to the source’s rest frame 


Vpeak = Vpeak,obs (1 F z): (5) 


A graph of the obtained values versus comoving distance is 
shown in Figure 4. 

The estimates of optical variability were calculated from the 
minimum and maximum point-spread function (PSF) magni- 
tudes presented in the Pan-STARRS data for each of the five 
filters (grizy). Our variability estimates are simple differences 
between these values for all the blazars with two or more 
observing epochs (according to the number of the measure- 
ments included in the mean PSF magnitude from the detections 
in a corresponding filter). The estimates are rough, as they 
depend on the time when the observing epochs have been 
carried out and on their number. As an example, the 
distribution of the number of observations in the i filter is 
shown in Figure 5. 

Roma-BZCAT presents X-ray fluxes for the range 
0.1-2.4keV (5-124 A), we recalculated them into the loga- 
rithmic scale logjo[erg cm” s']. Gamma radiation is 


14 https: / /github.com/DKudryavtsev /BZCAT-SED-Viewer 
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Figure 3. An example of SED fitting with a cubic polynomial to find the peak of the synchrotron component. 
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Figure 4. Frequencies of synchrotron peaks depending on comoving distance. 


represented in Roma-BZCAT in photons cm`?s™! for the 
range 1-100 GeV, we also recalculated the values into the scale 
logiolergcm *s~'], taking the middle of the range as the 
photon energy: 50 GeV. 

In addition to the Roma-BZCAT radio-to-optical spectral 
index, which characterizes the ratio between the radio and 
optical fluxes, we calculated other parameters that describe flux 
ratios at different frequencies of the electromagnetic spectrum 
(let us call these parameters the “hardnesses’”). They are 
calculated as decimal logarithms of the ratios between flux 
densities at the frequencies VF: .4 Gy, (radio), vFw2 (IR), vF; 
(optical), vFx (X-rays), and vF, (gamma-rays); e.g., the IR/ 
optical hardness is log,)(vFyw2 / VF) = logy)(vFw2) — log, (VF). 
The clustering model uses six such ratios, because not enough 
data are available for all the frequencies. 

The complete list of the parameters available in the final 
dataset includes more than 100 items. The dataset is 
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Figure 5. Distribution of the number of observing epochs for the Pan-STARRS 
i filter (“iMeanPSFMagNpt” in the catalog). 


schematically presented in Figure 6 and is published in the 
VizieR database. 


5. Feature Space and Model Data Set 


The immediate use of an entire dataset by ML algorithms is 
impossible. First of all, the data must be appropriately 
preprocessed to obtain meaningful results. Moreover, some 
features may have a large amount of missing data, some 
strongly correlate with each other, others are auxiliary and are 
not related to the actual properties of the objects (e.g., 
extinction), and some should be discarded since they do not 
affect the final result but increase the dimensionality. The best 
practice also is when we could provide the ML model not only 
with the direct characteristics of the objects but also with ML- 
specific features, which are combinations or transformations of 
actual characteristics that help the model to “understand” data 
better. Therefore, the next step after collecting the data is the 
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Figure 6. All dataset characteristics. The data are presented in the VizieR database. 


construction of the model dataset that will be directly used by 
the ML algorithms. This includes the selection of character- 
istics relevant to the problem, feature engineering and 
transformations, data cleansing, imputation of missing values, 
scaling, etc. 

Note that while this model dataset is constructed to be used 
by ML algorithms and undergoes certain transformations 
during this process, the predictions that a trained model 


produces in the end, such as the cluster label in our case, are 
object-specific. In other words, having cluster labels (member- 
ship) for the blazars as a result of our clustering, we can further 
analyze any other characteristics of the blazars from the 
original or model dataset along with even the new ones. 

The choice of the features to form the feature space of the 
clustering can be made using various approaches. In our case 
we used as many available blazar characteristics as possible. In 
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general, if there is not a predetermined scope of investigation, 
i.e., the study is not aimed at revealing relationships between 
some specific characteristics selected beforehand, this approach 
allows us to describe the objects under investigation most 
completely, form the groups of similar objects without a priori 
assumptions, and increase the reproducibility of the clustering 
results. 

Here we consider in detail the preparation of the model 
dataset to be directly used by the clustering algorithms. 


5.1. Dropping Unnecessary Characteristics 


Different ways of describing cosmological distances (the 
“Cosmology” cell in Figure 6) are a priori related by analytical 
dependencies and do not give new information to the model, 
therefore only one of them, the comoving distance, was chosen, 
as the distribution of blazars in this scale is most uniform (see 
Figure | above). 

Stellar magnitudes and flux densities are also analytically 
related. For the model dataset the latter were selected since they 
do not depend on the photometric system and are more directly 
connected to physical properties: the luminosity of an object 
and the distance to it. 

The R.A., decl. coordinates were excluded as we do not 
expect heterogeneity here and also because the spherical 
coordinate system (full circle in R.A., semicircle in decl.) leads 
to an artificial global structure in the data. 

The blazar types according to the Roma-BZCAT classification 
(BL Lacs, FSRQs, etc.) are a categorical feature, which in 
combination with other characteristics having continuous dis- 
tributions leads to a trivial solution: division into the known types. 
Therefore, this information was removed from the model dataset. 

As noted above, instead of the colors we used more 
smoothed parameters: tangents of the spectrum slope in the 
WISE and Pan-STARRS passbands. For this reason, colors 
were not considered in modeling. Other photometric passbands 
were not included due to lack of data (see below). 

Unfortunately, we had to exclude from the model dataset the 
data on radio and optical variability: these values significantly 
correlate with the number of observations (Tornikoski et al. 
2000; Nieppola et al. 2007; Khabibullina et al. 2023), which in 
our case generated an artificial cluster of radio variability: the 
differences were clearly visible at the most observed frequency 
of 5 GHz, while at other frequencies with fewer measurements 
the cluster was not that distinguished. 

We also excluded the radio spectral indices. The modeling 
showed that their distributions for the groups found in 
clustering had not had significant differences, therefore the 
final model was built without them. 

Extinctions were dropped as these are auxiliary data used in 
flux density calculations. 
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5.2. Dropping Characteristics with Many Missing Values 


A separate problem is missing values: almost all characteristics, 
to a greater or lesser extent, are subjected to lack of data. The 
processing of missing values is covered further in more detail, but 
characteristics with a very large number of absent measurements 
cannot be used in modeling. According to accepted empirical 
practice, we excluded features with more than 40% of missing 
data: in particular, the GALEX FUV values and associated 
characteristics, SDSS and 2MASS data, Roma-BZCAT (Fermi) 
data on gamma-ray fluxes, and data on the radio flux densities at 
143 GHz (1.4 and 5 GHz have remained). 

We once again note that characteristics excluded from the 
modeling can be used for analysis after the clustering. 


5.3. Outliers 


Measurements outstanding significantly from the distribution of 
a characteristic can distort results of most clustering algorithms. 
We considered the distributions of the features selected for the 
modeling and visually evaluated their boundaries. Blazars outside 
designated distribution boundaries were excluded from the model 
data set (not more than several objects for a feature, a total of 34 
objects), and their classification into groups was carried out after 
the clustering using a separately trained k-nearest neighbors 
(KNN) model (Section 7.2.2). 


5.4. Multicollinearity Combining Similar Characteristics 
into Meta-features 


The initial data contain sets of characteristics that naturally 
correlate with each other: the flux densities at different 
frequencies and part of the flux density ratios (hardnesses). 
Standard techniques in processing multicollinear features in 
ML are either their exclusion or Principal Component Analysis 
(PCA): transformation of the features by linear algebra methods 
into new mutually orthogonal (zero correlation) characteristics 
oriented in the feature space along the axes of the greatest 
variance. 

In this paper we used PCA to combine a number of flux 
densities at different frequencies into one meta-feature, using 
for that the first principal component. The sets of input 
characteristics and their corresponding meta-features are shown 
in Table 2. The choice of these two meta-features is based on 
the simple core—jet model of AGNs, where the radio emission 
is unambiguously related to the synchrotron radiation from the 
jet, while emission in other electromagnetic ranges can be 
generated by both the core regions and the jet. 

Note that for an individual blazar, measurements for some 
input characteristics may be missing. In such cases we imputed 
the missing values using probabilistic PCA (PPCA, Tipping & 
Bishop 1999). The PPCA implementation’ from Porta et al. 


15 https: //github.com/el-hult/pyppca 
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Table 2 
Sets of Similar (Correlated) Physical Characteristics and their Meta-features in 
the Model Dataset 


Characteristics Meta-feature 

X-ray, GALEX NUV, Pan-STARRS grizy, “Short-wavelength” flux, fir_x 
and WISE W1-W4 flux densities 

Flux densities at 1.4 and 5 GHz Radio flux, fradio 


No data 


(2005) was adopted. The method is considered in more detail in 
Section 7.2.1. We applied PPCA separately for each set of 
characteristics from Table 2. If all the corresponding values for 
an object were missing, we left them empty (at this first stage). 
In contrast to the above mentioned, for the hardnesses, some 
of which may also correlate with each other, it is important to 
preserve the information about differences in flux densities at 
various ranges of the electromagnetic spectrum. In this case, 
the multicollinearity was removed later during dimensionality 
reduction of the entire model dataset, also using PCA but 


taking more principal components (see Section 7.1.1). 


Data 


5.5. Scaling 
To ensure equal priority of features in PCA and clustering, 
all of them must be expressed in a unified numerical scale. At 
all the stages where that was necessary, we used the scikit-learn 
(Pedregosa et al. 2011) standard scaler, which produces the 
zero mean and unit variance for a feature. 


IR/x 


fia x 
f: adio 


x as 
S Q x 
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5.6. Model Data Set 


The clustering model dataset after data cleansing and feature 
transformations includes 14 features. The heatmap of the 
dataset is shown in Figure 7. The columns of the table are 
designated along the x axis, the y axis corresponds to its rows, al 2) 
in which individual object vectors are located. The heatmap Figure 7. Model dataset heatmap. The rows (y-axis) correspond to individual 

objects. The columns (x-axis) are the features. Missing data are shown by the 


shows missing data (less than 40% in each of the columns). 

light color. 
6. Selection and other Hindering Effects 
classes that naturally demonstrate different distance distribu- 
tions (because of the selection in the data or not); they also 
contribute to a more accurate probabilistic imputation of 
missing values (see Section 7.2.1). It is for these reasons that 
we leave the comoving distance and raw flux densities as 
model characteristics. The dependence on distance must be 


kept in mind during further analysis of the obtained groups. 
The second nuisance is the fact that blazars are variable 


sources. In our dataset we took the average characteristics of 
the objects in the way they are presented in most catalogs. In 
some cases different characteristics may be measured in 
different states of blazar activity (active/quiescent) (see, e.g., 


blazars. 
At the same time, these effects can be considered useful for 
Raiteri et al. 2014). This restricts our results to only the groups’ 


the clustering because they could potentially help separate 


We should note that our feature space is subjected to some 
effects that are negative for interpretation of the results that 
could be obtained from the clustering. In the first place, all 
selection effects are preserved, and almost all characteristics are 
dependent on the distance to the blazars (or redshift z). For 
example, as has already been mentioned in Section 4, the 
redshift-corrected radio luminosity nevertheless shows strong 
dependence on z due to the Malmquist effect. Even the flux 
density ratios are dependent on the distance because of the 

cosmological rest frame drift, which could not be corrected due 

to the absence of an accurate SED model for each of the 
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Statistical properties, and any conclusion for an individual 
source must be treated with great caution. 

Finally, the BZCAT catalog is not a complete flux-limited 
list of blazars. Although the incompleteness of the sample still 
allows us to perform the clustering and analyze the observed 
differences, it could influence the distribution of blazars within 
the clusters, i.e., population of certain groups (boundaries of the 
clusters in the feature space) may change for a more complete 
sample. We evaluate the effect of data incompleteness in more 
detail in Section 7.3. 


7. Clustering 


The clustering was carried out first with a subsample of 
blazars that had no missing data in the model dataset (858 
blazars, ~24% of the BZCAT catalog) and then with the full 
sample and imputed missing data. As well, we tested various 
clustering algorithms: several from the scikit-learn library 
(Pedregosa et al. 2011) and Kohonen’s self-organizing maps 
(SOMs; Kohonen 2001; Wittek et al. 2017) based on a 
competitive neural network. 


7.1. The Subsample Without Missing Values 


7.1.1. Clustering with K-means and PCA Dimensionality 
Reduction 


We experimented with several clustering methods offered by 
the scikit-learn library (Pedregosa et al. 2011): k-means, 
Gaussian mixture, agglomerative clustering, and spectral 
clustering. The results were compared by the internal clustering 
validation metrics: the silhouette (Rousseeuw 1987), Calinski— 
Harabasz (Califski & Harabasz 1974), and Davies—Bouldin 
(Davies & Bouldin 1979) scores. As the final option with the 
best indicators, the combination of PCA dimensionality 
reduction with k-means clustering was chosen. For dimension- 
ality reduction, we took an explained variance of 90% as a 
criterion, thus converting the 14 model dataset features into six 
metafeatures: mutually orthogonal (uncorrelated) principal 
components. After that, k-means clustering (Arthur & Vassil- 
vitskii 2007) was performed in this 6D space. 

A comparison of clustering validation metrics calculated for 
various numbers of clusters shows that the data distribution is, 
not surprisingly, a continuous cloud without localized groups. 
Thus, the number of clusters may be determined pretty loosely. 
We selected the number of clusters based on the best (~90%) 
match of the clustering results for the subsample without 
missing values and for the full sample, see Section 7.2.1. With 
this approach, the optimal number of clusters turns out to be 
five. For comparison, the popular “elbow” method, taken as the 
first approximation and based on the analysis of decreasing 
distortion (average squared Euclidean distance from the 
centroids of the respective clusters), gives the number of 
clusters equal to four. 
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Principal Component Plot 


PC2 


PC; 


Figure 8. The PCA biplot with the projections of the dataset onto the first two 
primary components. The vector lengths correspond to the importance of the 
features for the clustering result. 


These results are further used as a baseline model for 
clustering the entire Roma-BZCAT catalog, and are also 
compared with the approach based on SOMs. The visualization 
of the obtained clusters is presented in Section 7.1.3 along with 
the SOMs. 

The PCA also allows us to estimate the significance of the 
features for the clustering. To this end, we constructed a PCA 
biplot (Figure 8) using the Yel lowbrick library (Bengfort & 
Bilbro 2019). The figure shows the projection of the dataset 
onto the plane of the two primary components, and the lengths 
of the vectors correspond to the importance of each feature, 
reflected by the magnitude of the corresponding values in the 
eigenvectors of the primary components. The directions of the 
vectors also demonstrate the degree of correlation (same 
direction) or anticorrelation (opposite direction) between the 
features. Numerically, the contributions are given in Table 3. 
Note that here we do not use the characteristics related to 
gamma-ray measurements because of the scarcity of data. The 
gamma-ray range, nevertheless, may be of great importance for 
blazar classification. We consider this problem in more detail in 
Section 7.3 and give there a similar table (Table 4), where the 
importance of the features is recalculated taking into account 
the gamma-ray emission. 


7.1.2. Self-organizing Maps 


Kohonen’s SOMs (Kohonen 2001) are a neural network with 
competitive learning, used for clustering and visualization of 
multiparametric data that can contain non-obvious patterns. In 
particular, SOMs solve the problem of projecting a multi- 
dimensional space into lower dimensions: onto a plane or into a 
3D space, where data vectors are grouped according to the 
degree of similarity of their parameters, which allows one to 
perform the clustering, e.g., to separate different populations of 
sources. 
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Table 3 
Importance of the Features (Without Gamma-ray Range) 


Feature Contribution, % 
IR/UV 10.9 
Spectrum slope (Pan-STARRS) 9.3 
IR/opt 8.7 
Radio/UV 8.0 
Distance 79 
logLs 79 
Radio/IR 7.6 
Radio/X 7.3 
Radio-to-opt. sp. index 7.1 
IR/X 6.5 
Jradio 6.3 
Spectrum slope (WISE) 6.1 
fir-x 3.8 
log Vpeak 2.6 
Total 100 


Table 4 
Importance of the Features When Taking into Account the Gamma-ray Range 


Feature Contribution, % 
Radio/gamma 7.7 
f 6.5 
Radio /IR 6.2 
UV/gamma 6.1 
Opt/gamma 5.6 
X/gamma 5.4 
logLs 5.2 
Radio-to-opt. sp. index 5.0 
Radio/X 5.0 
log L, 4.7 
Spectrum slope (WISE) 4.6 
Distance 4.6 
Radio/UV 4.5 
IR/gamma 4.4 
IR/UV 4.3 
fasio 4.1 
Spectrum slope (Pan-STARRS) 3.7 
IR/opt 3.6 
IR/X 3.3 
fir-x 3.1 
108 Vpeak 2.3 
Total 100 


There are many software packages for data analysis by the 
SOM method, written in different programming languages. In 
this study we chose a Python package somoclu’® (Wittek 
et al. 2017). 

In the SOM clustering, a grid of 200 x 320 output neurons 
was built with the number of weights for each neuron equal to 


16 https: //github.com/peter wittek/somoclu 
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the dimension of the input vector (our object). The SOM 
algorithm finds the Euclidean distance between the vectors in a 
multiparametric space (the parameters are scaled to the interval 
[0, 1]) and adjusts the weights of the neurons so that they 
would be structurally similar to the distribution of the input 
vectors in the feature space. In this way the input vectors 
(objects) become arranged on certain areas on the output 2D 
SOM map in such a way that objects with similar parameters 
are located close to each other. At the same time, the 
distribution of the neuron weight vectors in the feature space 
becomes close to the data distribution. In other words, after 
training the network we have an ordered 2D structure of 
neurons with the high-dimensional data topology encoded in 
their multidimensional weights. 

The final step is also k-means, but in this case we make it 
using the weights of the trained neurons and then labeling the 
objects according to the cluster label of their nearest neuron. 
The advantage of this method over the PCA dimensionality 
reduction is that it can restore possible nonlinearities in data 
distribution, while the PCA is a more straightforward and 
interpretable method of linear algebra. 


7.1.3. Cluster Visualization and Comparison of the Two 
Methods 


To visualize the clustering results we can use the 
t-distributed stochastic neighbor embedding (t-SNE) algorithm 
(van der Maaten & Hinton 2008) or the 2D coordinates derived 
in the SOM clustering. The t-SNE algorithm converts 
similarities between data points to joint probabilities and tries 
to minimize the Kullback—Leibler divergence (Kullback & 
Leibler 1951) between the joint probabilities in the low- 
dimensional embedding and the high-dimensional data. In the 
SOM approach the coordinates are obtained as a result of 
neural network mapping. The result of our PCA + k-means six- 
dimensional clustering embedded in a 2D space is shown in the 
left part of Figure 9 in the t-SNE (top) and SOM (bottom) 
coordinates, respectively. The right part of the figure is, 
accordingly, for the SOM clustering. 

We should note that t-SNE (top panels in Figure 9) is a 
nonlinear algorithm focusing on the local similarity of points, 
and the results also depend on the selection of hyperparameters 
(mainly, the perplexity), therefore t-SNE visualization cannot 
be interpreted as a precise description of object positions in the 
feature space. For example, the formation of apparently 
localized groups or some displacement of points belonging to 
different clusters can be of artificial nature. Nevertheless, the 
figure shows that the results of both the PCA + k-means (upper 
left) and SOM (upper right) clusterings are well described by 
the t-SNE visualization. Some separation of cluster 0 from the 
general cloud is clearly visible. 

The bottom panels are the same PCA + k-means and SOM 
clustering on the SOM 2D plane. 
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PCA+k-means clusters in t-SNE coordinates 
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SOM+k-means clusters in t-SNE coordinates 
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Figure 9. A comparison of the PCA+k-means (left) and SOM (right) clustering. The coordinates are the conditional 2D t-SNE coordinates (top) and the SOM plane 
(bottom). The points correspond to individual blazars. Different clusters are shown by different colors/symbols. 


Comparing the left and right panels in Figure 9, we can 
conclude that the two methods give similar results and, despite 
the absence of localized groups, the boundaries of the clusters 
are pretty much the same for both methods. 

To compare the results numerically, we used the Rand index 
(Rand 1971) calculated in the standard way 


2(a + b) 


ZNN- D i 


where a is number of objects that remained in the same cluster 
after new clustering, b is the number of objects that remained in 
different clusters, and N is the total number of compared pairs. 
The Rand index shows what percentage of objects does not 
change their cluster membership in the two clusterings. For the 
comparison of the PCA + k-means and SOM clusterings on the 
subsample without missing values, we have achieved a Rand 
index of 0.95. 

Additionally, we also tested the approach where t-SNE is 
used directly to reduce the dimensionality of data, followed by 
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clustering in the 2D or 3D space of t-SNE coordinates. This 
method gave good results for the sample without missing 
values; however, for the imputed data and the full sample the 
results were unsatisfactory. 


7.2. Full Sample 
7.2.1. Missing Data Imputation 


Clustering of the complete BZCAT catalog was carried out 
according to the same scheme as for the subsample without 
missing values: dimensionality reduction using PCA followed 
by k-means clustering. The main difference is the need to fill in 
the missing values in the model dataset (Figure 7). We tested 
several approaches to solve this problem: imputation with the 
median values, ML regression models, namely XGBoost (Chen 
& Guestrin 2016) and scikit-learn (Pedregosa et al. 2011) 
implementations of Random Forest and Histogram-based 
Gradient Boosting, and finally the imputation of the missing 


Research in Astronomy and Astrophysics, 24:055011 (23pp), 2024 May 


values using PPCA (Tipping & Bishop 1999), which showed 
the best results. 

The method introduces a latent variable z, corresponding to 
an M-dimensional PCA subspace, and probability distributions 
p such that 


pP@=NO,D, pix) = NWz + u, oD, (7) 


where x is the observed variable (D-dimensional object vector 
in the feature space) and M(O, I) is the standard normal 
distribution. The matrix W € R?*™”, vector u (equal to zero 
after scaling), and constant o? determines the PCA transforma- 
tion. PPCA reduces to classical PCA at o?=0. Such 
mathematical formalism allows one to determine W, jz, and 
g” via the expectation-maximization (EM) algorithm and 
impute missing values by sampling from latent p(z). In our 
case we have used an implementation of PPCA that calculates 
the imputed values along with W and o° (Porta et al. 2005), 
which is considered by the authors as a more efficient 
approach. 

This final variant (PPCA) for the imputation of missing 
values was chosen based on the maximum similarity of feature 
distributions in the obtained clusters for the two samples: (1) 
when all missing values had been dropped and (2) with 
imputed values. The clustering results of the sample without 
missing values were used as reference cluster labels. 

The similarity of the distributions was estimated by the 
Kullback—Leibler divergence (Kullback & Leibler 1951) 
calculated in our case as a sum over all feature distributions 
per each cluster 


P(x) 
Ox)’ 


Dx (P||Q) = D797 >) P@)log 


D K xe 


(8) 


where D are the features in the model dataset, K are the 
clusters, and P(x) and Q(x) are the probability distributions on 
the sample space X, with P(x) for the full sample and Q(x) for 
the reference subsample without missing values. 

Additionally we evaluated the Rand index between the two 
clusterings: R ~ 90%, which is quite good, given the increase 
in the number of objects by about four times and the fact that 
cluster boundaries are drawn in a continuous “cloud” without 
clear localization of the groups. 

The visual comparison of the model feature distributions in 
the subsample without missing values and in the full sample is 
shown in Figure 10. We can see that the distributions retain 
their shape well after the substantial increase of the sample. 
Note also that the observed discrepancies for some parameters 
(e.g., a substantial difference in the median log Vpeak values for 
cluster 0) should not be treated as errors, as the membership of 
new objects in the clusters is defined based on the entire 
number of features, and some changes in the shape of the 
distributions are expected after increasing the sample by several 
times. For instance, the strongest observed difference in the 
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log Vpeak distributions is in good agreement with the lowest 
importance of this feature for the clustering (see Table 3). 

We also performed the clustering of the full dataset by the 
SOM method, which resulted in a Rand index of ~0.92 with 
respect to the PCA-+k-means technique. Thus, the two 
clustering methods showed a 92% similarity for the complete 
dataset along with even better concordance for the smaller 
dataset without missing values (95%). The similar results also 
affirm that there are no nonlinearities in the data distribution, 
which could not be taken into account by the PCA+k-means 
method. Therefore, PCA+k-means has been used for further 
analysis, as a more straightforward approach. 

The totality of the results obtained allows us to perform the 
cluster analysis not only for the subsample without missing 
values but for all blazars in the Roma-BZCAT catalog. The 
flowchart of the clustering stages is shown in Figure 11, the 
final result for the full sample is highlighted in the lower right 
corner of the figure. 


7.2.2. Classification of Outliers 


To include all the Roma-BZCAT blazars in the clustering 
results, the membership of the 34 outliers filtered out at the data 
cleansing stage (Section 5.3) must be determined. 

We used the KNN classifier from the scikit-learn 
library (Pedregosa et al. 2011) to complete the task. The dataset 
with the obtained cluster labels, which acted as a target variable 
for training the classifier, was divided into the training and test 
samples in a ratio of 0.1. The hyperparameters were optimized 
by 5-fold cross-validation on the training sample using a grid 
search: the number of nearest neighbors in the [5, 100] range 
without weighting and with distance-based weights. The 
quality metric was the Fl-score. The final hyperparameters, 
70 nearest neighbors with weights inversely proportional to the 
distance, gave an Fl-score of ~0.94 on the test sample: the 
harmonic mean of the precision and recall for the trained 
classifier reaches 94%. 

For the trained classifier to work correctly, all the stages of 
data preprocessing for the outliers must be performed in the 
same way as for the training dataset, but here we could not use 
PPCA to replace missing values since the PPCA implementa- 
tion we used computed them along with PCA transformation 
parameters, which must be fixed during the inference. For that 
reason, here we used the following approach: 


1. instead of the first PPCA step (for the multicollinear flux 
densities transformed to metafeatures), we took the mean 
value over the corresponding flux densities for each 
object; 

2. for the second PPCA step, the missing values in the 
model dataset were imputed as the mean values over a 
column (which is actually zero after the scaling); 

3. other transformations (scaling, traditional PCA, etc.) 
corresponded to the main clustering model. 
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Figure 10. Comparison of the model dataset feature distributions for the full Roma-BZCAT sample (upper parts in the panels) and the subsample without missing 
values (lower parts in the panels). The distributions are shown as boxplots. A box is the interquartile range (IQR, 25th-75th percentile, or Q1—Q3) of a parameter 
distribution, a median is shown as a vertical line inside the box, “whiskers” extend to show the rest of the distribution, except for points that are determined as 


“outliers,” locating beyond the median 4 


t1.5 IQR range; these outliers are shown by dots. The panels correspond to the features, and the clusters are marked with 


different colors (see the legend in the first panel). The full sample contains 3527 blazars (we do not consider here the 34 outliers mentioned in Section 5.3), while the 
sample with all the missing values dropped amounts to only 858 objects, a quarter of the full sample. Although we imputed the missing values to perform the 
clustering for the full sample, these distributions are based on only the real data in both cases. 
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Figure 11. The flowchart of the performed clustering stages. 


7.3. Robustness of the Clustering to Data set 
Incompleteness and Feature Selection 


Two conditions that can influence the obtained results are the 
incompleteness of the Roma-BZCAT sample and the features 
we selected to perform the clustering. The former can change 
the boundaries of the clusters after taking a sufficient enough 
amount of new blazars, and the latter could form a new feature 
space with additional information about the objects. Particu- 
larly, in our clustering dataset we did not take into account the 
characteristics connected with the gamma-ray emission in the 
gamma-ray ranges, but blazars emit a large amount of their 
radiation in gamma-rays, which means that the gamma-ray 
band should carry important information about the sources. 

The gamma-ray measurements, though, are too scarce to be 
used in the clustering of the whole Roma-BZCAT blazars: the 
Fermi-LAT gamma-ray flux in the catalog is given for only 
28% of the sources. The new Fermi-LAT measurements (Ajello 
et al. 2022) do not fundamentally change the situation: we 
evaluated that now 44% of the objects would have gamma-ray 
fluxes, which is still insufficient (over 60%-—70% of available 
data are needed.) 

Nevertheless, using the present gamma-ray data, we can still 
evaluate the degree to which the use of gamma-ray measure- 
ments is able to change the obtained clustering results as well 
as evaluate the influence of dataset incompleteness. To this end, 
we took only the objects with available gamma-ray measure- 
ments and calculated for this subsample additional gamma-ray 
features analogously to our previously described data prep- 
aration. The added features are gamma-ray flux, luminosity, 
and hardness ratios relative to other spectral ranges, a total of 
seven new features to complement the 14 already available. 
After dropping the missing values for all the 21 features, we 
end up with a small dataset of 396 sources. Thus, we, first, 
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shorten the list of objects to as few as 11% of the complete 
sample and, second, add 50% new features with sufficiently 
different information concerning the gamma-ray range. To 
separate the influence of the two effects, we (1) compared the 
results of the clustering performed with 14 original features on 
the small dataset and on the complete sample; (2) compared the 
results of the clustering performed with 14 original features on 
the small dataset and the results obtained on the same dataset 
with 21 features. The Rand indices for these two comparisons 
are 0.85 and 0.80, respectively; i.e., a ~90% incompleteness of 
the sample could change the result by about 1—0.85 = 15%, 
while the addition of 50% new features preserves it at a level of 
about 80%. 

From the above evaluations, we can state that clustering 
labels for the sources should stay the same within 80% of the 
current clustering results if new sufficient data on gamma-ray 
fluxes are going to be available in the future. We also evaluated 
the importance of the features if taken along with the gamma- 
ray range, the result is presented in Table 4. As one can see, the 
gamma-ray features occupy the upper rows of the table, thus 
demonstrating that the gamma-ray range is important for blazar 
classification, and new more abundant measurements would 
lead to better, more accurate, results. 


8. Properties of the Clusters 
8.1. Comparison with the Known Classes 


First of all, it is interesting to compare our clusters with the 
known types of blazars. Here we make such a comparison for 
the Roma-BZCAT blazar types, for the HSP blazars from the 
3HSP catalog, and for the blazars detected in the TeV energy 
range from the TeVCat catalog. 

In the Roma-BZCAT catalog, blazars are divided into the 
following subtypes. 


BZB BL Lac objects and BL Lac candidates, which are AGNs 
with a featureless optical spectrum or having only 
absorption lines of the host galaxy origin and weak 
narrow emission lines; 


BZG sources usually reported in the literature as BL Lac 
objects but having SEDs with significant dominance of 
host galaxy emission; 


BZQ FSRQs with the optical spectrum showing broad 
emission lines and dominant blazar characteristics; 


BZU blazars of an uncertain type, a small number of sources 
having peculiar characteristics but also exhibiting blazar 
activity: the occasional presence/absence of broad 
emission lines or other features, transition between a 
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Table 5 
Cross Identification with the Roma-BZCAT Classes 


BZCAT Classes 


Clust. BL Lac BZG BL Lac FSRQ Uncert. Total 
Cand. 
0 480 122 55 12 14 683 
1 339 141 14 91 64 649 
2 173 10 11 403 70 667 
3 50 1 8 602 49 710 
4 17 0 4 801 30 852 
Total 1059 274 92 1909 227 3561 


Note. BZGs are the galaxy-dominated BL Lacs. 


radio galaxy and a BL Lac, galaxies hosting a low 
luminosity blazar nucleus, etc. 


In Table 5 and Figure 12 we compare the population of the 
obtained clusters with the subtypes of blazars in Roma- 
BZCAT. The vast majority of BL Lacs and BZGs fall into 
clusters 0 and 1. Clusters 3 and 4, on the contrary, are 
dominated by FSRQs. Cluster 2 is a mixture of BL Lacs and 
FSRQs. Blazars of an uncertain type avoid cluster 0 and are 
less present in clusters 3 and 4. It is noteworthy that the largest 
number of them are in cluster 2, a mixture of BL Lacs and 
FSRQs, although a comparable number is found in cluster 1. 

Judging by the quality metrics obtained earlier, we assign 
blazars to a particular cluster with an accuracy of about 90%, 
therefore a small number of blazars of the “opposite” types in 
individual clusters, with the exception of cluster 2, can be 
considered expected. Taking into consideration the correlated 
continuous decrease/increase of the number of BL Lacs and 
FSRQs among the clusters, it also could be a real effect to some 
degree. In total, we can state that the clustering results largely 
correlate with the classification of blazars in the Roma-BZCAT 
catalog. At the same time, our clustering additionally 
distinguishes between two subclasses of BL Lacs (clusters 0 
and 1) and two subclasses of FSRQs (clusters 3 and 4). There is 
also no division into BL Lacs and galaxy-dominated BL Lacs, 
although the almost complete absence of the latter in the 
“mixed” cluster 2 could be noted. 

Blazars are classified into a separate type of AGNs since they 
have a distinct orientation of the jet, pointing toward the 
observer at a small angle. As well as other AGNs, they have 
similar structure (e.g., a supermassive black hole, an accretion 
disk, a jet, etc.) and similar processes (the collimation of the jet, 
acceleration of electrons in a magnetic field, accretion of matter 
onto the central object), but these processes occur under 
different physical conditions, which causes the division of 
blazars into different subclasses according to the observed 
parameters. Thus, FSRQs have strong emission lines and 
higher luminosity compared to BL Lacs in almost all frequency 
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ranges, which is probably related to the more abundant fueling 
matter and, consequently, different accretion modes. The fact 
that different blazar types are not isolated in our clusters but 
demonstrate a continuous per-cluster distribution validates the 
commonly accepted uniformity of blazars’ nature. Note also 
that although we intentionally avoided any predetermined 
categorical separation such as the presence or absence of 
emission lines, the clustering correlates with the BL Lac/FSRQ 
classification, thus demonstrating that this difference in 
physical conditions can be obtained from other characteristics. 

In Figures 13 and 14 we demonstrate how the HSP blazars 
from the 3HSP catalog (Chang et al. 2019) and the blazars 
detected in the TeV energy range from the TeVCat catalog 
(Wakely & Horan 2008) are distributed within our clusters. 
Figure 14 clearly shows that almost all HSP blazars are 
members of cluster 0, and the rest of them, located in cluster 1, 
lie closer to the boundary of the two clusters. The TeV blazars 
are not that concentrated in a particular cluster, but have a 
tendency to be more abundant in BL Lac-populated clusters 
0-1 than in FSRQ-populated clusters 3—4. The overall number 
of TeV blazars is small, and there are only few of them found in 
the latter clusters, so their presence in the FSRQ-populated 
clusters is questionable and may be caused by clustering 
inaccuracy; at the same time the descent of the number of TeV 
blazars from cluster 0 to cluster 4 in Figure 13 looks pretty 
smooth. 


8.2. Description of the Clusters 


To investigate the properties of the selected groups, we 
analyzed the differences in the distributions of blazar 
characteristics in the obtained clusters. Partially, these 
differences are demonstrated in Figure 10. Some statistics of 
the distributions are given in Table 6. Additionally, in 
Figure 15 the luminosities or absolute magnitudes for different 
spectral ranges are given. The values should be used with 
caution: while the radio luminosities are corrected for different 
redshifts using radio spectral indices, the other values are not. 
Therefore, for instance, for high-redshift blazars the optical 
absolute magnitude in the i filter actually corresponds to the 
UV range in the source’s frame of reference. The effect is 
strong for the IR and optical ranges, but is less significant for 
X-rays and gamma-ray luminosities as they are measured in 
broad bands and anyway stay within the corresponding 
electromagnetic ranges at any redshift, drifting, though, to 
higher frequencies. A better understanding of the ratios 
between flux densities in different ranges of the electro- 
magnetic spectrum without the z-dependent bias may be 
obtained from Figure 16, which shows the rest-frame average 
SEDs for the resulting clusters. 

To construct the average SEDs, we normalized the SEDs of 
individual blazars by the measured synchrotron peak flux 
density (Section 4, Figure 3) and averaged the flux densities in 
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Figure 12. Cross identification with the Roma-BZCAT classes. Each panel corresponds to a certain blazar type and shows the number of blazars of this type within the 
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Figure 13. Cross identification with the HSP blazars from the 3HSP catalog 
and the blazars detected in the TeV energy range from the TeVCat catalog. The 
panels are organized analogously to Figure 12. 


50 bins. The error bars in Figure 16 correspond to the standard 
deviation within the bins. To demonstrate the difference in 
luminosities, the normalized spectra are additionally adjusted to 
the radio luminosity at a frequency of 5 GHz (log,, L5). The 
SEDs are recalculated to the rest frame. 

To better visualize the difference between cluster statistics 
demonstrated in Figures 10 and 15, and Table 6, we also 
constructed polar diagrams shown in Figure 17. The figure 
reflects the difference between median values of various 
characteristics: the polar diagrams are scaled in such a way 
that the maximum observed medians over all clusters 
correspond to values of | (the outer edges of the circles), 
while the minimum medians correspond to zero values (the 
centers of the circles). Each “azimuth” corresponds to a 
particular characteristic. 

Before describing the average SEDs, we first should mention 
the following. The SED of a blazar has typically a shape of two 
humps and constitutes a complex mix of emission from 
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different parts of an AGN. The first hump, extending from 
radio waves to X-rays, is believed to be formed by the 
synchrotron radiation in the jet. Part of the photons emitted in 
this process may experience synchrotron self-Compton scatter- 
ing (e.g., Bloom & Marscher 1996), contributing to the second, 
gamma-ray, hump. Photons from other AGN components such 
as the accretion disk, dust torus, and broad emission line clouds 
also contribute to the gamma-ray hump via inverse Compton 
scattering. The location of this gamma-ray emission is still a 
subject of research (e.g., Rani et al. 2016). Additionally, the 
accretion disk emits its own thermal radiation peaked in the 
optical-UV ranges, while the dust torus adds to the IR 
emission. The corona of the accretion disk can also scatter 
photons up to X-ray energies. At last, the SED may be affected 
by the host galaxy. 

All this complexity results in that a detailed description can 
only be made for a particular SED via complex modeling and/ 
or analysis of its variability time series for different ranges of 
the electromagnetic spectrum. Nevertheless, in the case of 
blazars, we have an advantage that their jets are inclined with 
respect to the line of sight at a small angle, therefore we can 
expect that the differences in the average SEDs are caused to a 
greater degree not by the geometric effects but by the different 
physical conditions in the AGN, whatever these conditions are. 


8.2.1. Clusters 0 and 1: BL Lac Subclasses 


These two clusters consist of BL Lacs and galaxy-dominated 
BL Lacs located at relatively small distances (up to 3 Gpc, 
z<0.9). The percentage of FSRQs is only 2% and 
14% respectively, see Table 5. Cluster blazars are distinguished 
by relatively reduced luminosity in the entire range of the 
electromagnetic spectrum (Table 6, Figures 15 and 16). Also, 
they have significantly reduced radio hardness parameters 
(Table 6, Figure 10). As well, the objects have lower gamma- 
ray luminosities (the same table and figures). 
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Figure 14. HSP and TeV blazars on the t-SNE cluster map. The points correspond to individual blazars. Different clusters are shown by different colors/symbols. The 
bigger symbols correspond to the HSP (left panel) and TeV (right panel) blazars. 


Table 6 
Cluster Characteristics for the Whole Roma-BZCAT Catalog 


Medians (Min—Max Values) 


logig Vpeak» [Hz] 
(Synchrotron Peak) 


logo Ls, [W/Hz] 


logio (vF 4/VFw2) 
(Radio/IR) 


(Radio Luminosity) 


14.5 (11.7, 18.4) 
13.6 (11.7, 15.6) 
13.1 (11.3, 15.5) 
13.4 (11.5, 16.3) 
12.9 (10.4, 15.5) 


24.8 (22.8, 27.4) 
25.2 (22.0, 27.6) 
27.0 (25.0, 29.1) 
26.8 (25.1, 29.3) 
27.6 (26.1, 29.7) 


—2.2 (—3.2, —0.3) 
—2.0 (3.8, —0.9) 
—1.2 (—3.6, +0.4) 
—1.4 (—3.0, —0.2) 
—0.6 (-1.7, +1.5) 


logjo(vFi 4/vFyuv) 


logio YF 4/VE) 

(radio /Ņ-rays) 
—4.1 (—5.2, —3.1) 
-3.7 (5.0, —2.2) 
-3.2 (4.5, —2.1) 
—3.3 (—4.4, —1.8) 
—3.0 (—4.5, —1.8) 


logio(vFw2/VF) 
(IR/optical) 
—0.3 (—1.4, +0.7) 
—0.1 (—1.3, +1.1) 
+0.3 (—0.4, +1.8) 
—0.2 (—1.2, +0.7) 
—0.3 (1.4, +0.7) 


logy )(VFw2/VFNuv) 
(IR/UV) 
—0.2 (—1.0, +1.0) 
+0.2 (—0.6, +1.5) 
+0.5 (—0.5, +2.4) 
—0.4 (—1.8, +0.5) 


—0.1 (—1.0, +1.1) 


Rad.-opt. d, Mpc 
Cluster N sp. Index 

0 (BL Lacs) 683 0.4 (—0.4, +0.8) 1200 (120, 4970) 
1 (BL Lacs) 649 0.4 (—0.3, +0.8) 1050 (10, 5750) 
2 (mix) 667 0.7 (+0.3, +1.2) 3050 (270, 8690) 
3 (FSRQs) 710 0.6 (+0.2, +0.9) 3390 (80, 7670) 
4 (FSRQs) 852 0.7 (+0.5, +1.2) 5010 (1660, 8140) 

logio (VFI 4/VFx) 

(radio/UV) (radio /X-rays) 
0 (BL Lacs) 683 —2.4 (—3.7, —0.4) —2.8 (—4.4, —1.4) 
1 (BL Lacs) 649 —1.8 (—3.5, —0.2) —1.6 (—3.4, —0.2) 
2 (mix) 667 —0.7 (—2.0, +1.1) —0.6 (—1.9, +1.5) 
3 (FSRQs) 710 —1.7 (—4.0, —0.8) —1.0 (—2.4, +0.2) 
4 (FSRQs) 852 —0.8 (—1.6, +0.9) —0.6 (—2.1, +1.0) 

log, 9(VFw2/VFx) log, )(VFw2/VE,) 

(IR/X) (IR/y-rays) 

0 (BL Lacs) 683 —0.6 (—1.9, +0.8) —1.8 (—3.1, —1.1) 
1 (BL Lacs) 649 +0.5 (—0.7, +2.7) —1.6 (—2.7, —0.1) 
2 (mix) 667 +0.6 (—0.5, +2.5) —1.8 (—3.4, —0.8) 
3 (FSRQs) 710 +0.3 (—0.9, +2.0) —1.8 (—3.0, —0.7) 
4 (FSRQs) 852 —2.2 (—3.6, —1.4) 


log, (VF /VE,) 


logig Lx, [W] 


logy) L}, [ph/s] 


+0.1 (—1.3, +1.0) 


Cluster O has the most prominent characteristics. From 
Figure 16 we can see that both the synchrotron and gamma-ray 
humps are factually not visible in the average SED. By 
reviewing some of the individual SEDs in the cluster, we have 
found that actually they have a standard shape of two humps. 
Therefore, this effect in the average SED is caused by the broad 
distribution of synchrotron peaks, seen in Figure 10. 

The average SED of cluster 1 has the classical shape of two 
humps, nevertheless with both humps not that prominent. The 
difference in shape with cluster 0 is likely caused by a more 
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(optical /-rays) (X-ray lumin.) (y-ray lumin.) 
—1.4 (2.9, —0.7) 37.6 (35.6, 39.4) 47.1 (45.5, 48.3) 
—1.5 (—2.9, +0.2) 36.9 (33.5, 39.1) 47.3 (45.0, 50.1) 
—2.2 (—4.0, —0.6) 38.0 (35.7, 39.9) 49.0 (46.7, 51.0) 
—1.9 (—3.1, —0.6) 38.2 (36.3, 40.1) 48.6 (46.6, 50.4) 
—2.1 (—3.7, -1.3) 38.7 (37.4, 40.3) 49.2 (47.5, 50.4) 

compact distribution of synchrotron peak frequencies 


(Figure 10). 

This broad variation in the synchrotron peak frequency, 
especially noticeable in cluster 0 (see Figure 10), in the entire 
range of logy ~ 11.7-18.4, is a characteristic peculiarity of the 
clusters, in other clusters the distributions are more compact, 
and high frequencies of the synchrotron peak are practically not 
found. One can notice from Figure 10 that almost all HSP 
blazars should belong to cluster 0. Earlier in Section 8.1 we 
considered the distribution of sources from the 3HSP catalog 
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Figure 15. The distributions of luminosities or absolute magnitudes in different ranges of the electromagnetic spectrum. The boxplots are constructed analogously to 
Figure 10. Each panel corresponds to a particular characteristic, and the distributions within the clusters are shown by different colors along the y-axes. Warning: the 
radio luminosities are corrected for different redshifts using radio spectral indices, the other values are not. 


(Chang et al. 2019) across our clusters and confirmed that 529 
of 657 3HSP blazars presented in Roma-BZCAT belong to 
cluster 0, and 118 to cluster 1, while only 10 are found in 
clusters 2, 3, and 4 (see Figure 13). Note also that the 3HSP 
blazars from cluster 1 are located close to the boundary of 
clusters 0 and 1. 

Note that within z < 0.9, where the blazars of clusters 0 and 1 
are located, the actual frequency emitted by a blazar becomes 
higher for large z, but all the frequencies used in the clustering 
feature space remain within their electromagnetic bands: 
3.35 um => 1.76 um (IR), 7520 A => 3960 A (optics), 2316 A => 
1220 A (UV) for the most distant objects. The radio, UV, X-ray, 
and gamma-ray frequencies remain within their bands for all z 
considered in this paper. For the average SEDs, the flux densities 
were transformed to the rest frame beforehand. 


8.2.2. Cluster 2: BL Lac-FSRQ Mix 


Cluster 2 is represented by a mixture of BL Lac (29% with 
BL Lac candidates) and FSRQ (60%) blazars, and the 
remaining 11% of the objects are of an uncertain type 
(Table 5). It practically does not contain galaxy-dominated 
BL Lacs, a small number of them (10 objects) can be attributed 
to clustering errors. 

In contrast to clusters 0 and 1, in cluster 2 we observe high 
radio, X-ray, and gamma-ray luminosities as well as bright 
absolute IR magnitudes comparable to other FSRQs 
(Figures 15 and 16). The absolute magnitudes in the optical 
and UV ranges are somewhat weakened compared to other 
FSRQs. In this connection, it is of interest to evaluate if the BL 
Lacs in cluster 2 are anyhow different from those in clusters 0 
and | or the statistically higher luminosities are only related to 
the presence of FSRQs. In Figure 18 we compare the radio 
luminosity distributions for the BL Lacs, and the observed 
difference clearly testifies that BL Lacs from cluster 2 are a 
special BL Lac subclass with high luminosity. 
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The average SED of the cluster in Figure 16 shows the most 
smooth shape of classical two humps. The distributions and 
Statistics in Figures 10 and 15, and Table 6 naturally follow 
from this shape. 

Within the observed redshifts for cluster blazars, 
z~ 0.05-2.5; the IR radiation, when converted to the rest 
frame of the source, remains generally within the IR range, 
approaching optical wavelengths for the most distant objects: 
3.35 um = 0.96 um; optical radiation moves into the UV range 
with growing distance: 7520 A= 2150A; UV radiation 
becomes harder: 2316 A = 660 A. Again, the average SEDs in 
Figure 16 are calculated in the rest frame and do not suffer from 
differing redshifts. 


8.2.3. Clusters 3 and 4: FSRQ Subclasses 


Clusters 3 and 4 are populated by FSRQs: 85% and 94%, 
respectively, or 91% and 97% if we exclude blazars of an 
uncertain type (see Table 5). Blazars from these clusters have 
high luminosities in the entire frequency range. The main 
difference between clusters 3 and 4, and the upper described 
cluster 2, as can be seen from the average spectra in Figure 16, 
is the degree of irregularities in the two-hump SED shape. 
While in cluster 2 we observe a smooth spectrum, in cluster 3 
the synchrotron hump becomes somewhat flattened due to 
enhanced emission at frequencies logov > 15 in the rest 
frame, and in cluster 4 the average SED obtains a step-like 
shape. The statistical characteristics in Figures 10 and 15, and 
Table 6 reflect the shape of the average SEDs. The source of 
the observed irregularities may be excessive emission from the 
central parts of AGNs. 

Blazars from cluster 3 have sufficiently lower radio hardness 
parameters (Figure 10). Cluster 4 demonstrates noticeably 
higher median luminosities in the radio, X-ray, and gamma-ray 
ranges. Statistically, this cluster contains the most luminous 
objects at higher redshifts, if we do not consider few individual 
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Figure 16. Average SEDs for the clusters in the rest frame. The frequency range is from radio waves (lower abscissa values) to gamma-ray emission. The y-axis is 
scaled in such a way that the median luminosity log,) L5 corresponds to the 5 GHz flux density of the average SED. The SEDs are fitted using 7th-degree polynomials 
(the orange lines). 


objects in cluster 2 with the greatest redshift and gamma-ray the scientific scope of the problem; the algorithm implemented 
luminosity. to find the clusters; and the number of clusters, a trade-off 
Distances are from 500 to 6500 Mpc (z~ 0.05-2.5) for between uniformity and individuality. 

cluster 3 and from 2000 to 8000 Mpc (z ~ 0.4—4) for cluster 4. Here for the feature space we used the maximum available 
For z= 4 the frequency shifts in the rest frame are as follows: number of characteristics that could be related to the properties 
IR radiation goes into the optical range 3.35 wm = 6700 A; of the objects. Such an approach helps describe blazars in the 
optical radiation goes into the UV range: 7520 A= 1500 A; most complete way possible. The process of feature selection is 
UV radiation becomes harder: 2316 A = 460A. described in detail in Section 5. In total, the model feature 
space comprises 14 continuously distributed characteristics. 

9. Summary Although adding new parameters (for instance, because of the 


growing number of observations) will inevitably change the 
clustering results for particular objects, especially on the cluster 
boundaries, the overall patterns observed in the sample will be 
preserved, unless the amount of new characteristics is 
comparable to the number of those in the original feature 
space. This allows us to discuss with a certain robustness the 
sample properties as a whole, which is confirmed by the 
comparison with the known blazar classifications. Cluster 
membership of any particular objects, though, must be 
considered with great caution and additional analysis. More- 
over, as no localized groups are revealed in the feature space 
and due to the incompleteness of the Roma-BZCAT catalog, 
boundaries between the clusters are only conditional and might 
change for a more complete sample. We evaluated the 
influence of sample incompleteness and feature selection on 
the clustering results in Section 7.3 and expect that our cluster 


In this paper we discuss the applications of the cluster 
analysis technique to the multiwavelength properties of the 
blazars from the Roma-BZCAT catalog. We divided the blazars 
into five groups and compared them with the Roma-BZCAT 
classification, HSP blazars from the 3HSP catalog and TeV 
blazars from TeVCat. We have found similar trends in blazar 
grouping, which confirms the effectiveness of the clustering 
technique. The obtained groups (clusters) are derived based on 
multiparametric distributions of blazar characteristics. 

To perform the project, we collected data from the radio to 
gamma-ray ranges both from the Roma-BZCAT catalog itself 
and from various other point-source catalogs, mostly those 
containing a sufficient amount of data for the sample. During 
clustering, the blazars were treated in the same manner 
regardless of the degree of our knowledge about them, e.g., 
we did not add any additional measurements from other 


catalogs for some well-known objects, thus preserving the labels will be preserved with a Rand index of 80% after adding 
homogeneous approach to the sample as a whole. a sufficiently large amount of new information. 

In general, clustering algorithms build an independent We have tested several clustering algorithms and finally 
unsupervised classification that is based almost solely on the settled on two of them: PCA+k-means and SOMs. The 
multiple properties of the objects under consideration. In this advantage of the latter is that SOMs can restore possible 
sense, clustering is a more uniform and homogeneous approach nonlinearities in a data distribution, while PCA dimensionality 
to classify cosmic objects based on the experimental data reduction is a more straightforward and interpretable method of 
avoiding subjective selection bias. The method, nevertheless, linear algebra. By showing the 90% similarity of their results, 
has its own hyperparameters (the parameters that are set by the we demonstrate the absence of nonlinearities in our data as well 
researcher rather than learned by the model from data): the as some robustness of feature space division into clusters: 
feature space, which determines the characteristics relevant for although the final stages use the k-means algorithm in both 
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Figure 17. Polar diagrams for scaled medians. Clusters 0—4 are shown from left to right and from top to bottom, also with different colors. The outer edges of the 
circles correspond to the highest median values, while the centers are the minimum median values. Each “azimuth” corresponds to a particular characteristic. 


cases, they work in sufficiently different spaces, a 14D space of 
neuron weights in the case of SOMs and a 6D space of PCA 
components for the other method. As the methods in our case 
have shown similar results, we followed PCA+k-means for 
interpretation clarity. 

The number of clusters is a hyperparamater that can be set 
rather loosely in the clustering problems. Generally, addition of 
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a new cluster leads to division of an existing one into two 
subclasses. In the case of a continuous data distribution, the 
boundaries of clusters may also vary. We chose the number of 
clusters to be five based on the best match between data 
distributions within the clusters obtained for the subsample 
without missing values and for the whole sample where the 
missing values were imputed via PPCA. 
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Figure 18. Difference in radio luminosity distributions of the low-luminosity 


BL Lacs in clusters 0-1 (blue color) and the high-luminosity BL Lacs in cluster 
2 (red color). The boxplots for the distributions are shown at the top. 


Speaking of the latter, we found it to be most effective for 
data imputation among other methods: imputation with 
medians or various ML regressions. Note that these imputed 
values were only used to perform the clustering for the 
complete sample, we did not use them for the statistical 
analysis of the derived clusters. 

The following notable characteristics of clusters have been 
derived. 


Cluster 0 

Consists of BL Lac-type blazars with low luminosities in all the 
ranges from radio to gamma-rays except X-ray emission. 
Almost all known HSP blazars fall into this cluster. The 
synchrotron hump is not visible in the average SED as well as 
the second hump in the gamma-ray range; this effect is caused 
by the broad distribution of synchrotron peak frequencies in the 
cluster. The cluster is characterized by low radio luminosity, 
radio hardness parameters (Table 6), and redshifts z < 0.9. 


Cluster 1 

Consists of BL Lac-type blazars with low luminosities in all the 
ranges from radio to gamma-rays. The average SED is of the 
usual shape with two humps, but the gamma-ray hump is 
weaker than usual and comparable by flux densities with the 
synchrotron one. These blazars have low radio luminosities and 
radio hardness parameters, and redshifts z < 0.9. 


Cluster 2 

A mix of BL Lac-type objects (29% including BL Lac 

candidates) and FSRQs (60%). The rest of the objects (11%) 

are of an uncertain type. These blazars have high luminosities, 

strong radio hardness parameters, and a clear smooth average 

SED with two humps. The redshifts span over the entire range. 
We show that the BL Lacs from this cluster form a special 
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subclass of high-luminosity BL Lacs compared to the low- 
luminosity population in clusters 0 and 1. 


Cluster 3 and 4 

FSRQs. The blazars have high luminosities. The clusters are 
primarily distinguished between each other by the degree of 
irregularities in the SED shape that may be caused by the 
influence of the emission from the AGN central parts. Blazars 
from cluster 3 demonstrate statistically lower radio hardness 
parameters compared to the FSRQs from clusters 2 and 4. 
Cluster 4 has noticeably higher median luminosities in the 
radio, X-ray, and gamma-ray ranges; actually this cluster 
contains, on average, the most luminous objects at higher 
redshifts, although individual record holders in terms of 
redshift and gamma-ray luminosity fall into cluster 2. 


Our results are consistent with the term “blazar sequence” 
that originated in Fossati et al. (1998) to describe the properties 
of blazar SEDs. The most well-known feature of this 
phenomenological sequence is the negative correlation between 
the synchrotron peak frequencies and synchrotron peak 
luminosities of the blazar population, i.e., HSP blazars having 
the lowest luminosities and the highest synchrotron peak 
frequencies and LSP blazars with the opposite characteristics. 
But in a later study by Nieppola et al. (2008), that anti- 
correlation was not found in the intrinsic blazar properties after 
correcting the observed data for the Doppler beaming effects, 
and another study by Giommi et al. (2012a) also showed that 
the originally reported anti-correlation was due to a selection 
effect. Following works based on various multiband data just 
confirmed that the emissions in blazars is strongly beamed and 
affects the observational phenomenon known as “blazar 
sequence” (e.g., Fan et al. 2017; Ouyang et al. 2023; Wan 
et al. 2024). In our study we did not apply the Doppler factor as 
one of the parameters because it is estimated for a limited 
number of blazars, e.g., for 979 Fermi blazars in one of the 
recent studies (Chen et al. 2024). Therefore, we cannot test the 
intrinsic nature of the blazar sequence. 

Keenan et al. (2021) studied a dichotomy in jets, dividing 
more than 2000 blazars into two samples: one with inefficient 
accretion weak/type I jets and the second with efficient strong / 
type II jets. The first group contained blazars with synchrotron 
peak frequencies above 10'° Hz (HSPs, nearly all BL Lacs), 
and the second one comprised mostly FSRQs and some LSP 
BL Lacs. This quite accurately coincides with our findings both 
by synchrotron peak frequency values and blazar type 
distribution, i.e., clusters 0 and 1 contain blazars with type I 
jets, and clusters 3 and 4 are the blazars with type H jets. 

We believe that these groups of Roma-BZCAT blazars, 
derived as a result of multiparametric analysis, can be used as 
additional information for further research, for example in the 
search for correlation with neutrino events or other statistical 
investigations. 
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The dataset with various characteristics of the blazars and 
cluster labels is available in the VizieR database. 
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